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Abstract 

Sliding window is one direct way to extend a successful 
recognition system to handle the more challenging detection 
problem. While action recognition decides only whether or 
not an action is present in a pre-segmented video sequence, 
action detection identifies the time inten’al where the ac¬ 
tion occurred in an unsegmented video stream. Sliding win¬ 
dow approaches can however be slow as they maximize a 
classifier score over all possible sub-intervals. Even though 
new schemes utilize dynamic programming to speed up the 
search for the optimal sub-interval, they require offline pro¬ 
cessing on the whole video sequence. In this paper, we pro¬ 
pose a novel approach for online action detection based on 
3D skeleton sequences extracted from depth data. It iden¬ 
tifies the sub-interval with the maximum classifier score in 
linear time. Furthermore, it is suitable for real-time appli¬ 
cations with low latency. 

1. Introduction 

Human action detection at real-time has become a topic 
of increasing interest due to its wide practical use. Appli¬ 
cations like Human-machine interaction, surveillance and 
gaming, all require accurate and low-latency action detec¬ 
tion. Action detection on raw videos is difficult because it 
is first needed to localize a person in a scene full of ob¬ 
jects and clutter, then try to recognize the type of action 
being performed. On the other hand, the recent low-cost 
depth sensors, like Microsoft Kinect, provided a more con¬ 
venient way for data capture. The 3D positions of body 
joints can be estimated from depth maps at low-latency and 
with acceptable accuracy. Filtering out background clutter, 
it is now more adequate to perform action detection based 
on skeleton data. Recently, skeleton-based approaches to 
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action recognition and detection have been widely adopted. 
While action recognition focuses on identifying the action 
label of pre-segmented video sequences, action detection 
tackles the more challenging problem of temporally local¬ 
izing the action in an unsegmented stream of frames. 

The main contribution of this paper is a novel approach 
for action detection from skeleton data, that we refer to as 
Efficient Linear Search (ELS). We show that a combina¬ 
tion of simple components and specializing them towards 
skeleton-based action detection can achieve state-of-the-art 
results and overcome the limitations of similar approaches. 
The proposed approach is flexible: it can be used with a 
wide class of classifier functions and with different types of 
action local descriptors. As a byproduct contribution, we 
propose a simple skeleton-based local descriptor that, when 
used in a simple bag-of-features model, produces state-of- 
the-art results on different datasets. The proposed frame¬ 
work works online and is suitable for real-time applications. 
Moreover, it can be used for real-time video segmentation, 
since it specifies both the start and end frames of the action. 

The rest of this paper is organized as follows: section [2] 
gives an overview about recent related work in the literature. 
We show the used action representation and our proposed 
descriptor in section [3] We, then, explain our Efficient Lin¬ 
ear Search approach in section [4] Experimental evaluation 
is presented in [5] And finally, we conclude in [6] 

2. Related work 

Forming suitable skeletal-based descriptors for action 
recognition has been the focus of many recent research 
works mmmmmmmm- The objective is to 
facilitate the recognition task via a discriminative descrip¬ 
tor. Some of these descriptors capture both the pose of the 
skeleton and the kinematics at the same time on the frame 
level. For example, Nowozin et al. CD proposed a local de¬ 
scriptor that uses 35 angles between triplets of joints, 0(i o ), 
along with angular velocity, SQ(to), to encode joints’ kine¬ 
matics. Joints angles are a powerful cue to skeleton pose. 



Moreover, they are invariant to body translation and rota¬ 
tion. Later, Zanfir et al. m proposed the Moving Pose 
descriptor, which captures both the body pose at one frame, 
as well as the speed and acceleration of body joints within 
a short time window centered around the current frame. 

Another class of descriptors is focused on computing a 
fixed length descriptor for the whole action sequence, like 
ll6l [71 fl5l 171 . Gowayyed et al. 0 used a 2D trajectory 
descriptor, called “Histogram of Oriented Displacements” 
(HOD), where each displacement in the trajectory casts a 
vote, weighted by its length, in a histogram of orientation 
angles. Vemulapalli et al. E) modeled human actions as 
curves in a Lie group, since 3D rigid body motions are 
members of the special Euclidean group. Wang et al. El 
used the 3D joints positions to construct a descriptor of 
relative positions between joints. However, descriptors on 
the whole sequence suffer from much higher dimensional¬ 
ity over those that were designed for the frame level. This 
higher dimensionality led sometimes to the need for feature 
selection as done in !T2llT4i| . 

While most of the focus on skeletal data was about action 
recognition, fewer works focused on the online problem. 
The trade-off between latency and accuracy was addressed 
in recent works muni nano] mi im in on the notion of 
action points was first introduced and the detection problem 
was cast as a classification problem for every overlapping 
35-frames intervals. The same notion of action points was 
utilized in the work of m, but they could handle different 
scales. Other works, such as mmm used the stan¬ 
dard sliding window protocol for online action detection. 
Zhao et al. l23l proposed a feature extraction method, called 
“Structured Streaming Skeleton” (SSS), which constructs a 
feature vector for each frame using a dynamic matching ap¬ 
proach. The SSS feature vectors are then used for detecting 
the start and end of actions. Zanfir et al. B22| used a mod¬ 
ified kNN classifier to detect the start and end of actions. 
However, both 12211231 cannot handle multi-scale actions, 
where the same action can be performed at different speeds. 

3. Bag-of-gesturelets for action classification 

The concept of local features first appeared in object de¬ 
tection in images fl3l . The main idea is that each object 
has a set of discriminative local features that, if appeared 
together, signify the existence of the object. Same concept 
applies to actions. For a specific action, we can identify a 
set of key frames that best capture the discriminative poses 
of this action. However, skeleton poses alone cannot dis¬ 
tinguish between some actions, e.g. standing up vs. sitting 
down, since other information like the direction and speed 
of motion play an important role in identifying the action. 
So, differential quantities that describe joints’ kinematics 
must be included in an action’s local features. We then de¬ 
fine a gesturelet to be any general local feature that, for any 


frame, captures both the skeleton pose and kinematic infor¬ 
mation of body joints at this point in time. In the following, 
we first introduce our action representation as a bag of ges- 
turelets. Then, we explain our local features descriptor for 
representing gesturelets. 


3.1. A bag-of-gesturelets representation 


Based on extracted gesturelets from action sequences, 
we make use of a bag-of-gesturelets (BoG) representation 
of human actions. We first extract features at each frame. 
Resulting descriptors are then clustered to produce a K- 
entry codebook. We then represent any action sequence 
or sub-sequence by its cluster histogram, where the his¬ 
togram counts how many local features from each cluster 
index have occurred. In order to relax the assignment of 
each gesturelet to its representative cluster in the codebook, 
we apply soft binning ; letting each gesturelet cast a vote to 
its m nearest clusters. The vote will be weighted according 
to the distance between the gesturelet and the corresponding 
cluster, so that closer clusters get higher weight. Implemen¬ 
tation details for soft binning are presented in section [5T2[ 


The BoG representation is necessary for our approach 
for action detection, as we show in section [4] However, 
as shown in section 5.3 it is also very effective for action 
recognition. This BoG representation is independent of the 
choice of the local descriptor used to represent a gesturelet. 
Possible descriptors that capture both pose information and 
joints kinematics are DUE] [To] El. 


3.2. Our local descriptor 

The type of local descriptor has a direct impact on 
the recognition performance. We experimented with dif¬ 
ferent descriptors, and as a byproduct contribution, we 
achieved best results with a proposed local descriptor 
that is a weighted concatenation of the angles descrip¬ 
tor El and a slight modification of the Moving Pose de¬ 
scriptor ||22]Q The angles descriptor uses 35 angles be¬ 
tween triplets of joints, 0(io), along with angular veloc¬ 
ity, SQ(tg). On the other hand, the Moving Pose descrip¬ 
tor relies on joints positions, P{to), relative to a reference 
joint, namely the hip center. And to capture kinematic in¬ 
formation, it includes the first and second order derivatives; 
SP(to) and S 1 2 P(to). The final form of our descriptor is 
[[P, a5P , /35 2 P] | <50]], where a and [3 are parameters 

defined in l22l . and Lisa weighting parameter to the rela¬ 
tive importance of the two concatenated descriptors. In our 
experiments in section[5] we show that using our simple de¬ 
scriptor and with basic dictionary learning, we can achieve 
state-of-the-art results over different datasets. 


1 We modify the descriptor by rescaling the vector of concatenated joint 

positions to unit norm. 





Figure 1: Efficient Linear Search Overview. The goal of the training phase is to learn weights for each features-cluster. At test time, each 
extracted feature will be weighted by its cluster weight, and incorporated in the search for the maximum sub-interval sum. If the sum at 
time ti is above a learned threshold 9 C and starts to decrease, then a detection is triggered for action class c. 


4. Efficient Linear Search (ELS) 

In this section, we first explain our approach on offline 
action detection. Then, we show how it can be easily ex¬ 
tended to work for online detection as well. Figure |T| gives 
an overview of the approach. 

Although many successful skeleton-based action recog¬ 
nition systems exist, most of them haven’t been extended to 
action detection. Sliding window approaches can be used 
for this task, evaluating the classifier function over all can¬ 
didate sub-intervals. So, for a video sequence S, it will 
identify the sub-interval, saction = [t s t, tend], for which the 
classifier function produces the maximum score. 

Saction = argmax s csf{s ) (1) 

This identifies only one occurrence of the target action in 
the sequence. If multiple occurrences are to be found, then 
we can simply remove sub-intervals corresponding to pre¬ 
viously identified actions, and repeat the search for the next 
Smax■ However, for a sequence of N frames, we have 
0(N 2 ) candidate sub-intervals, which incurs significant 
computational complexity. Thanks to Lampert et al. GO, an 
efficient branch-and-bound method was proposed to search 
for the optimal bounding box of an object in an image. A 
limitation to branch-and-bound approaches is that they typi¬ 
cally work offline, requiring the whole search space before¬ 
hand, as in mm. in ed, Yuan et al. proposed a direct 
extension of GO on action detection from RGB videos. They 
used a bag-of-features model based on spatio-temporal fea¬ 
tures, and proposed an offline action detection system. An¬ 
other limitation to f8l l2TTl is that, optimizing equation[T|may 
not always produce the desired behavior for a detection pro¬ 
cedure since the optimal interval may contain multiple con¬ 
secutive instances of an action. This problem is less likely 
to happen in the case of 2D images, for which the original 
branch-and-bound approach was first introduced 0. 


In the following, we present a specialization of the 
branch-and-bound approach to the case of skeleton-based 
action detection. This turns out to be easily cast as one of 
the well-known dynamic programming problems that can 
be solved in linear time. Next, we show that a greedy ap¬ 
proximation can effectively address the offline limitation of 
the branch-and-bound approach, as well as the problem of 
combining consecutive actions. We assume two conditions: 
(1) a bag-of-gesturelets representation of action sequences, 
and (2) a linear binary classifier with good recognition ac¬ 
curacy trained for a specific target action. While the linear 
classifier constraint is not necessary, as we later show, we 
will use this assumption for simplicity of explanation. 

4.1. Offline action detection 


For any linear classifier, the corresponding scoring and 
decision functions take the form of: 


f(S) = w T x + w o, y{S) 


1 if/(-S') > 0 

0 otherwise 


( 2 ) 


where S’ is a test sequence or sub-sequence, x is the feature 
vector of S, w is the weight vector learned by the classifier, 
and wo is a constant bias. With the linearity of the dot prod¬ 
uct, w T x, and the fact that x is a histogram that counts the 
occurrence of each cluster index, the scoring function[2]can 
be rewritten as: 

n 

/(-S') = w 0 + ^2 w ci ( 3 ) 

3 =1 

where, c :j is the cluster index to which gesturelet x 3 belongs, 
and n is the total number of gesturelets extracted from se¬ 
quence S. 0We can then evaluate the classifier function over 

2 For simplicity of presentation, this formulation does not consider soft 

binning in histogram construction. Including soft binning is straight for¬ 
ward though. 
































(a) max subarray ending at fi= [f 3 , D ]. Sum = 1 < 9 C 
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i f 2 T ^4 Is ^6 0 

(c) max sum still increases. Action is still ongoing. 


(b) sum = 8 > 6 C . Action is being performed 



Figure 2: Example of online detection. We assume the action threshold 9 C = 7. Detected interval is [f 3 , fg]. 


any sub-sequence s C S by summing the weights, w c ., of 
gesturelets x 3 that only belong to the sub-sequence s. Since 
we want to identify the sub-sequence s C S that maximizes 
equation (|3j, we can safely drop the bias term, wq. 

The offline detection procedure will then be as follows: 
construct an empty ID score array, with length equal to 
the number of frames. This array will represent the per- 
point contribution of all extracted features from test se¬ 
quence S. For each feature Xj , identify its cluster index 
Cj, then add up its weight, w c , to the frame index in the 
score array. Finding the start and end frames of the ac¬ 
tion, s ac ti on = [t s t, t en d \, that satisfy equation[T]can now be 
mapped to finding the maximum subarray sum in the score 
array. Thanks to Kadane’s algorithm m, this can be done 
in linear time in the number of frames, using dynamic pro¬ 
gramming. If f(s max ) > 9 C , where 9 C is a learned threshold 
for action class c, then we specify that action c has occurred, 
and we return the start and end frames of the sub-sequence 
Smax- To learn the action threshold for each action class, 
we compute the score of all training sequences, and search 
for the score threshold, 9 C , that minimizes the binary classi¬ 
fication error on the training data for class c. 

To detect different action classes, we will construct a 
one-versus-all binary classification model for each action 
class. What differs from one action class to another is the 
classification model weights assigned to each entry in the 
codebook, along with the learned threshold 9 C . The de¬ 
tection procedure for each of the classes can then be run 
concurrently on the action sequence with different per-point 
feature weights for each action class. 

It is noted that, while the linearity of the classifier is a 
sufficient condition, it is not necessary. Any type of classi¬ 
fication method that can output a classification score at the 
frame level can be used with our approach, since we later 
learn the score threshold, 9 C , explicitly. This gives flexibil¬ 


ity to our approach with a wider range of classifiers. 

4.2. Online action detection 

In this section we extend our approach to handle on¬ 
line action detection. To find the maximum subarray sum, 
Kadane’s algorithm scans through the array. At each posi¬ 
tion (frame), it identifies the subinterval ending at this po¬ 
sition with the maximum sum. Since we only trigger the 
action if the maximum sum is larger than a learned thresh¬ 
old 9 C , then at any time ti, if the maximum sum ending at 
frame i exceeds then we know that action c is being per¬ 
formed, and our task, then, is to find at which frame the 
action will end. We expect that after the action ends, we 
will encounter a contiguous sequence of frames with nega¬ 
tive scores. So, for the next frames, as long as the maximum 
sum grows, the action is still on-going. Once the subarray 
sum starts to decrease, we can assume that the action has 
ended. There will be a compromise between latency and 
confidence. If we specify the end of action at the first nega¬ 
tive point score, this will be very sensitive to noise but it will 
achieve low latency. If we wait till we encounter two con¬ 
secutive points with negative scores, then this will achieve 
higher confidence at the expense of increased latency, and 
so on. In our experiments, we found that triggering the ac¬ 
tion at the first frame with negative score, after exceeding 
the threshold 9 C , causes a slight decrease of accuracy, while 
achieving very low latency. Figure [2] shows an example of 
how online detection works. 

As we hinted before, a problem to the offline detection 
procedure is that it may combine consecutive repetitions of 
the same action in one detection. If an action is repeated 
twice, with a short pause between them, we expect that 
each repetition will yield a high positive response, while the 
pause between them will have a negative response, but not 
as high. So, optimizing equation [I] would combine the two 













































repetitions together. To separate the two repetitions, the sep¬ 
arating negative score must be high. This problem is solved 
in online detection with the greedy approach to terminat¬ 
ing the action, which only requires consecutive frames with 
negative scores, without further restrictions on their weight. 
Section 5.5 compares offline and online detection to high¬ 
light this offline detection limitation. 

Detecting multiple classes in the online case is similar to 
the offline case, except that if one action class is detected 
at time fj, then the detection procedure for all other classes 
is reset, and the search for a new detection starts from time 

t»+1- 


5. Experimental results 

In this section, we first describe the two datasets used 
in our experiments. We then expand on our implementation 
details. Then, we compare the performance of our approach 
to the state of the art on both datasets. We start with the 
performance on the more classical action recognition task, 
then, we move to online action detection. Next, we com¬ 
pare offline and online detections. After that, we show how 
the performance of our approach is affected by changing 
its main parameters. Finally, we demonstrate its real-time 
performance. 

5.1. Datasets 

MSRC-12: The Microsoft Research Cambridge-12 
dataset 0 is a large dataset designed for action de¬ 
tection. It contains more than 700,000 frames in 594 
unsegmented sequences, encompassing 6,244 gesture 
instances, recorded for 30 subjects performing 12 different 
gestures. The samples consist of the 3D positions of 20 
joints of the body skeleton captured using the Microsoft 
Kinect sensor at 30 fps. The MSRC-12 dataset is an¬ 
notated using the notion of an Action Point, which is a 
pose within the gesture that clearly identifies its completion. 


MSR-Action3D: MSR-Action3D dataset 0 is a stan¬ 
dard dataset for action recognition. It consists of 557 
pre-segmented sequences, with more than 20, 000 frames. 
There are 10 subjects performing 20 different action ges¬ 
tures. Similar to MSRC-12, the 3D positions of 20 joints 
are captured using the Microsoft Kinect sensor. 

5.2. Implementation details 

For the linear classifier through our experiments, we use 
an SVM [[2) with a linear kernel. The local descriptor has 
three parameters, a, /3, and ip. While a, 6 are inherited from 
the Moving Pose descriptor (22l . we introduced ip to weight 
the relative importance of the two concatenated descrip¬ 
tors. Coarse-grain values for the parameters were learned 
from the training set, using different combinations of the 
3 parameters in a brute-force manner. Trial-and-error was 


then used to fine-tune the parameters’ values. For MSR- 
Action3D, we split the training set, persons {1, 3, 5, 7, 9}, 
into training and validation sets, using 40% of the train¬ 
ing data (persons 7 and 9) as a validation set. Learned 
parameters (a = l,/3 = 1 ,ip = 1.7) were then fixed 
for all test experiments on MSR-Action3D. For MSRC-12, 
the parameters were learned over one modality (20% of 
the dataset), namely the video modality. Learned values 
(a = 0.375, j3 = 0.3, ip = 0.2) were then used in test ex¬ 
periments over all modalities. For soft-binning, we set the 
number of neighbors, m, to 3. The vote to the i th nearest 
cluster is weighted by l/i. A control experiment for choos¬ 
ing m is shown in section [576] 

The unsegmented sequences for the MSRC-12 dataset 
contain pauses between consecutive action instances, in 
which the actor often stands still in a neutral pose. Such 
a neutral pose also occurs at the beginning and ending of 
most action instances. Therefore, the neutral pose does not 
discriminate between different action classes, and hence, 
the classifier may be tempted to give it a neutral weight, 
possibly positive in sign. This would cause a problem for 
our detection procedure, which relies on having negative 
scores right before and after an action instance. To over¬ 
come this problem, we add hard negatives to the negative 
training samples used to train our binary classifiers. Each 
of these hard negatives consists of one positive instance fol¬ 
lowed by a pause between two consecutive instances. In this 
way, the classifier is forced to give strong negative scores 
to the neutral poses in order to discriminate between those 
hard negatives and the positive samples, which solves the 
issue for the detection procedure. A similar problem occurs 
in the detection on unsegmented sequences from the MSR- 
Action3D dataset. Actions start and end with neutral poses, 
which could again cause issues in localizing the beginning 
and ending of action instances. In this case, we include con¬ 
catenations of two action instances as hard negatives in the 
binary classifier training. 

A weighted moving-average on the frames scores was 
applied, where the anchor frame had weight equal to num¬ 
ber of its neighbors. For MSRC-12, we used a window of 
5 frames, anchored at the middle frame (just as in the lo¬ 
cal descriptor). For MSR-Action, we used a window of 3 
frames, since MSR-Action3D sequences are much smaller 
than those of MSRC-12. 


5.3. Action recognition 

First, to demonstrate the discriminative power of the 
BoG representation 3.1 we report recognition results on 
both MSR-Action3D and MSRC-12 datasets. For MSR- 
Action3D, we follow the same experimental setup as 0; 
dividing the dataset into 3 action sets and training with se¬ 
quences performed by subjects {1, 3, 5, 7, 9}. Average ac¬ 
curacy over the 3 action sets is then reported. We report 





Fothergill et al.\5\ 

Sharaf et al.\ 141 

ELS 

Video - Text 

0.679 ±0.035 

0.713 ±0.105 

0.790 ± 0.133 

Image - Text 

0.563 ± 0.045 

0.656 ±0.122 

0.711 ±0.228 

Text 

0.479 ±0.104 

0.521 ±0.072 

0.622 ±0.246 

Video 

0.627 ±0.052 

0.635 ± 0.075 

0.726 ±0.225 

Image 

0.549 ±0.102 

0.596 ±0.103 

0.670 ± 0.254 

Overall 

0.579 

0.624 

0.704 


Table 3: Detection experiment for MSRC-12 dataset at A = 333 ms latency. Mean F-score and its standard deviation is reported for each 
instruction modality. 


Method 

Accuracy 

Eigenjoints fT9| 

82.3% 

Random Occupy Pattern 11168 

86.2% 

Actionlets Ensemble 1171 

88.2% 

Covariance Descriptor (Cov3DJ) |7J 

90.5% 

Angles Covariance Descriptor 1141 

91.1% 

Histogram of Oriented 
Displacements (HOD) (6) 

91.26% 

Fusing Spatiotemporal Features 1241 

94.3% 

Group Sparsity and Geometry Constrained 
Dictionary Learning (DL-GSGC) llO'l 

97.27% 

Our Approach 

96.05 ±0.39% 


Table 1: Comparative recognition results on MSR-Action3D. 


Method 

Accuracy 

Cov3DJ Descriptor (7) 

94.48% 

Our Approach 

96.83% 


Table 2: Recognition results on MSRC-12 dataset. 


the average classification rate and its standard deviation 
over 5 runs with 5 different codebooks. Table [I] compares 
our results to state-of-the-art approaches. As results show, 
the only approach that outperforms ours is DL-GSGC flOl . 
where a dictionary learning algorithm is proposed for sparse 
coding. DL-GSGC adds group sparsity and geometric con¬ 
straints to reconstruct feature points with minimal error. On 
the other hand, we achieve very competitive results using 
basic dictionary learning with simple k-means clustering. 

We also report classification results on MSRC-12 
dataset. Since this dataset is originally unsegmented and 
labeled with action points only, we use the annotation and 
experimental setup of ED and compare our results to theirs 
in table [2] It is noted that, in ED, the covariance descrip¬ 
tor is used as a global descriptor constructed for the entire 
action sequence, while we use a local features approach to 
represent the action sequence. This signifies the power of 


using local pose information in addition to joints kinematics 
information, instead of encoding global kinematic informa¬ 
tion only as in (71. 

5.4. Online action detection 

5.4.1 Action detection on MSRC-12 dataset 

As mentioned before, MSRC-12 is annotated with action 
points. So, most approaches experimenting on MSRC-12 
(e.g. EHH) convert the problem of real time action detec¬ 
tion into classifying each frame as an action point or not, 
in real time. A positive detection is regarded when the ges¬ 
ture occurrence is detected within a short time window from 
the ground truth action point. Thanks to CD, they provided 
manual annotation for the start and end frames of each ges¬ 
ture instance, which is needed to train classifiers for accu¬ 
rate action segmentation. To be able to compare our results 
to l5lfT4l, we regard the gesture occurrence to begin at the 
start frame as in the annotation of 0. and ends at the ac¬ 
tion point annotation. Our method, then, should trigger the 
occurrence of a gesture at its ground truth action point as 
in 0H3i. Within each modality, we measure the precision 
and recall of each action class across all 10 folds. We then 
report the average F-score. 

We use two different precision-recall experimental pro¬ 
tocols. First, a positive detection is counted if the detection 
is triggered within 10 frames from the action point, this is 
a latency of 0.333 seconds as in ||5] [14). Table [3] compares 
results of ELS against state-of-the-art results on MSRC-12. 
|^| It is noted that we don’t compare with l23ll . as they use a 
different protocol for F-score calculation. 

Second, we use the standard precision-recall experimen¬ 
tal protocol used in object detection in images, which is an 
overlap threshold of 0.2, as in j3j[22j. This is to demonstrate 
the power of our approach for real-time segmentation of the 
temporal sequence, identifying the start and end of a ges¬ 
ture online and in real-time. Since we are the first to report 
overlap results on MSRC-12, we obtained the state-of-the- 
art code of Sharaf et al. M and reported its overlap results. 

3 Before the camera ready version, we found ED which reports better 
results on MSRC-12. 































Sharaf et al. [ 141 
at 0.2 overlap 

ELS 

at 0.2 overlap 

ELS 

at 0.5 overlap 

Video - Text 

0.684 ± 0.074 

0.921 ±0.126 

0.866 ±0.146 

Image - Text 

0.687 ± 0.099 

0.894 ±0.085 

0.806 ± 0.099 

Text 

0.558 ±0.092 

0.788 ±0.139 

0.710 ±0.158 

Video 

0.669 ±0.082 

0.895 ± 0.068 

0.821 ±0.093 

Image 

0.598 ±0.082 

0.858 ±0.086 

0.734 ±0.130 

Overall 

0.639 

0.871 

0.787 


Table 4: Overlap detection experiment for MSRC-12 dataset at 0.2 and 0.5 overlap ratios. Mean F-score and its standard deviation is 
reported for each instruction modality. 


We compare results in table[4] The results show that our ap¬ 
proach significantly outperforms the state-of-the-art results 
in the overlap experiment. Our results for a 0.5 overlap ra¬ 
tio are still even better than Sharaf el al. with a 0.2 over¬ 
lap ratio. This emphasizes the power of our approach for 
real-time action segmentation. It is noted that although M 
uses a multi-scale approach, but it operates on the level of 
the whole action sequence. On the other hand, the granular¬ 
ity of our approach is a single frame with a small temporal 
window of 2 frames on each side. Such finer granularity 
gives higher flexibility when identifying the start and end of 
actions of different lengths and/or speed. 

5.4.2 Action detection on MSR-Action3D dataset 

To further test our approach on another dataset, we con¬ 
duct the same experiment of f2Zl for online action detec¬ 
tion on MSR-Action3D. Since MSR-Action3D is designed 
for action recognition, where action sequences are pre¬ 
segmented, l22l concatenates all test sequences in a random 
order to create one long unsegmented test sequence for ac¬ 
tion detection. To be able to compare our results to ll22l . we 
train with persons {1,2, 3,4, 5} and test with the rest. We 
also repeat this experiment 100 times with different random 
concatenation ordering as done in l22l . and compare our 
results in table [5] As results show, we are on the bar with 
state-of-the-art results on this dataset. It is, however, noted 
that a limitation to ll22ll is using k-NN search on all train¬ 
ing frames. This works well for small datasets, like MSR- 
Action3D. However, for large datasets, k-NN will poten¬ 
tially incur significant space and time requirements. On the 
other hand, the complexity of our approach is primarily af¬ 
fected by the codebook size, which is significantly smaller 
than the number of frames in the training data. 

5.5. Offline vs online detection 

Table [6] compares offline and online detection results on 
both MSRC-12 and MSR-Action3D datasets. It reports the 
detection F-score for a 0.2 overlap ratio. The results high¬ 
light the limitations of offline detection and how the online 


Method Detection mean AP 

The Moving Posel22l 0.890 ± 0.002 

ELS 0.902 ± 0.007 


Table 5: Detection experiment for MSR-Action3D at a 0.2 over¬ 
lap ratio. 

detection overcomes them, as illustrated in |4.2| Since the 
unsegmented sequences in MSRC-12 are mainly repeated 
performances of the same action class, the offline detection 
procedure first concatenated all repetitions in one long de¬ 
tection. Although hard negatives |5.2| partially solved this 
problem, but still around 85% of the gestures that failed 
to be detected, were missed due to concatenation to sub¬ 
sequent actions. For MSR-Action3D, around 64% of ac¬ 
tions that failed to be detected were missed due to partial 
or complete concatenation to subsequent actions. For ex¬ 
ample, in the synthesized unsegmented test sequence, there 
were 14 occurrences where the same action was repeated 
twice. While the online detection successfully separated 
and detected the two repeated actions in 13 out of 14 oc¬ 
currences, the offline detection failed to separate the two 
repeated actions in all 14 occurrences. 


Dataset 

Offline F-score 

Online F-score 

MSRC-12 

0.524 

0.871 

MSR-Action3D 

0.794 

0.930 


Table 6: Offline and online detection F-score, for a 0.2 overlap 
ratio. 

5.6. Sensitivity analysis 

In this section we show the sensitivity of our frame¬ 
work’s performance to changing different parameters. We 
conduct these experiments on the MSR-Action3D dataset 
since it is of reasonable size, which allows extensive exper¬ 
imentation. Without loss of generality, we conduct these ex¬ 
periments on the more basic action recognition task. Since 






















(a) Recognition results vs codebook size for 
MSR-Action3D 


97 



1 2 3 4 5 


number of bins for soft binning(m) 


(b) Recognition results vs number of neighbors, m, 
for soft-binning 


Figure 3: Control Experiments 


our results depend on the constructed codebook, whose con¬ 
struction involves randomness, we repeat each experiment 
5 times using 5 different codebooks and report the mean 
result and its standard deviation. 

The first parameter is the choice of codebook size, K. 
The suitable codebook size is expected to be proportional to 
the target dataset size and the number of gesture classes in 
it. Figure [3a] shows recognition results on MSR-Action3D 
with different codebook sizes. The results show that our 
framework is not sensitive to the choice of codebook size. 
It works well even in the case of small codebooks. Also, 
the small standard deviation means that the clustering ran¬ 
domness in codebook construction is of minor effect. This 
indicates both the robustness of our approach, and the dis¬ 
criminative power of our local descriptor. 

Next, we show the effect of using soft binning [3~T| with 
different choices of to, where m is the number of neighbors 
for which each feature casts a weighted vote. For this exper¬ 
iment, we fix the codebook size to 2500, and plot recogni¬ 
tion result vs. different settings of to in figure[3b] Although 
recognition accuracy improves in the case of using soft bin¬ 
ning, the more important effect is evident in how the stan¬ 
dard deviation significantly reduces with an increased value 
of to. This is because, with soft binning, it is less likely to 
miss voting for the correct representative cluster, unlike the 
hard assignment to one cluster only. 

Last, we show the effect of using different local descrip¬ 
tors with our approach, in table [7] We achieve best results 


using our proposed descriptor 3.2 


5.7. Real-time operation 


The goal of our framework is not only to perform online 
action detection, but also to do this in real-time. There are 
3 main factors affecting the running time of our approach: 
1) Size of the codebook; 2) Size of the local descriptor. 3) 
Number of neighbors, to, to vote for, in soft binning. For 
this experiment, we used a large codebook of size 3000. The 
dimensionality of our descriptor is 250, and we used to = 3. 


Used descriptor Accuracy 

Angles Covariance m 83.93% 

Angles Descriptor fTTl 86.25% 

The Moving Pose O 94.01% 

Our proposed descriptor 96.05% 


Table 7: Effect of using different features in our approach on 
MSR-Action3D recognition 


The average running-time per frame of our MATLAB im¬ 
plementation]^] was measured to be 10.7tos. So, our frame¬ 
work can process approximately 93 frames per second. The 
running-time was measured on a machine with 2.2 GHz In¬ 
tel quad-core Core-i7 processor and 12 GB RAM. 

6. Conclusion 

We have proposed both a simple skeleton-based descrip¬ 
tor and a novel approach for action detection. The proposed 
approach maximizes a binary classifier score over all possi¬ 
ble sub-sequences, typically in linear time. It can be used in 
conjunction with a large class of classifiers and with any lo¬ 
cal descriptor. Our proposed approach works online and at 
real-time with low latency. It detects a gesture by specifying 
its start and end frames in an unsegmented video sequence, 
which makes it suitable for real-time video temporal seg¬ 
mentation. While the proposed method relies on simple 
components, we showed that a specialization for skeleton- 
based action detection can be established which, not only 
outperforms the state-of-the-art, but also overcomes the lim¬ 
itations of similar approaches. 
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